8 research outputs found

    Abordagens de aprendizagem não supervisionada para fluxos de dados não estacionários

    No full text
    Modern society is surrounded by several applications which are daily generating large volumes of data. Nowadays, anyone can monitor their physical activities in real-time by using smartphones or wearable devices. Also, business and governments can learn more about their clients and citizens by analysing information from social media, for example. This data is called data streams when it is a sequence of data generated continuously, usually at high speed. This data is also potentially unbounded in size and may not be strictly stationary. Extracting useful knowledge from data streams is challenged due to several constraints. A data stream requires that a learning algorithm acts in dynamic environments. Meaning that the learning algorithm should allow for real-time processing. Moreover, it should be able to adapt to changes over time, considering the non-stationary nature of the data stream. In the last few decades, many machine learning approaches have been proposed for data streams. Most of them are based on supervised learning. These approaches rely on labelled data to adapt their models to the changes in data streams. However, the process of labelling data is usually costly and can require domain expertise. Besides, if the data is collected at high speed, it may be the case that there will not be enough time to label it. In this thesis, we aim to propose unsupervised and incremental machine learning algorithms for data streams. We focus on algorithms able to update their classification model with few or without external feedback. We start by addressing the problem of concept drift in data streams with few labelled data. For that problem, we propose a semi-supervised approach called Sliding Window Clusters. This method learns the current patterns from the data stream by selecting and summarising the most relevant data. We also study how to learn from data streams when novelties appear over time. So, we proposed an unsupervised learning method called Higia which is able to classify data as normal, novelty or concept drift. In this thesis, we propose an approach to combine different unsupervised approaches into a classification model. We test this approach considering two scenarios. The first is called Homogeneous Ensemble Clustering for Data Streams and it is based on the combination of different runs from the same clustering algorithm. In this study, we also consider the scenario called Heterogeneous Ensemble Clustering for Data Streams, which is based on the combination of different clustering algorithms. These methods allow for the use of clustering approaches with a different bias to obtain a more robust classification model. Furthermore, we evaluate the state-of-art approaches, commonly referred to in the literature of novelty detection in data streams. Most of this thesis focus on clustering approaches. However, given the popularity of neural networks, we also propose Ensemble of Auto-Encoders. This approach is based on the combination of auto-encoders into an ensemble model. Each auto-encoder is specialised on recognising one particular class. The Ensemble of Auto-Encoders has a modular structure that has the advantage of making the model easily adapted to the changes from the data. Besides, it allows for personalised models because the model can adapt to the most request classes. This contribution is applied to the problem of Human Activity Recognition. Experimental results show the potential of the approaches mentioned.A sociedade moderna está cercada por diversos aplicativos que geram diariamente grandes volumes de dados. Atualmente, qualquer usuário pode monitorar suas atividades físicas, em tempo real, usando seus celulares ou dispositivos vestíveis. Além disso, empresas e governos podem aprender mais sobre seus clientes e cidadãos analisando dados disponíveis em mídias sociais, por exemplo. Esses dados são chamados de fluxo contínuo de dados quando são gerados em sequência e continuamente, geralmente em alta velocidade. Esses dados também são potencialmente ilimitados em tamanho e podem não ser estritamente estacionários. Extrair conhecimento de fluxos de dados é desafiador devido a várias restrições. O fluxo contínuo de dados requer que um algoritmo de aprendizagem atue em ambientes dinâmicos. O que significa que o algoritmo de aprendizagem deve permitir o processamento em tempo real. Além disso, deve ser capaz de se adaptar às mudanças ao longo do tempo, considerando a natureza não estacionária do fluxo de dados. Nas últimas décadas, muitas abordagens de aprendizado de máquina foram propostas para fluxo contínuo de dados. A maioria dessas abordagens é baseada na aprendizagem supervisionada. Essas abordagens dependem de dados rotulados para adaptar seus modelos às mudanças nos fluxos de dados. No entanto, o processo de rotular os dados costuma ser caro e pode exigir a utilização de especialistas no domínio em questão. Além disso, se os dados forem coletados em alta velocidade, pode não haver tempo suficiente para rotulá-los. Nesta tese, propomos algoritmos de aprendizado de máquina incremental e não supervisionado para fluxo contínuo de dados. Esses algoritmos são capazes de atualizar seus modelos de classificação com pouco ou sem feedback externo. Começamos abordando o problema de mudança de conceito em fluxo contínuo de dados, com poucos dados rotulados. Para esse problema, propomos uma abordagem semi-supervisionada chamada Sliding Window Clusters. Este método aprende os padrões atuais do fluxo contínuo de dados selecionando e resumindo os dados mais relevantes. A segunda abordagem é um algoritmo de aprendizagem não supervisionada chamada Higia que é capaz de classificar os dados em normal, novidade ou mudança de conceito. Na terceira abordagem presente nesta tese, propomos um algoritmo para combinar diferentes abordagens não supervisionadas em um modelo de classificação. Testamos essa abordagem considerando dois cenários. O primeiro é denominado Homogeneous Ensemble Clustering para Data Streams e é baseado na combinação de diferentes execuções do mesmo algoritmo de agrupamento. Neste estudo, também consideramos o cenário denominado Heterogeneous Ensemble Clustering para Data Streams, que se baseia na combinação de diferentes algoritmos de agrupamento de dados. Esses métodos permitem o uso de abordagens de agrupamento com um viés diferente para obter um modelo de classificação mais robusto. Além disso, avaliamos as abordagens do estado da arte, comumente citadas na literatura de detecção de novidades em fluxos de dados. A maior parte desta tese enfoca abordagens de agrupamento. Porém, dada a popularidade das redes neurais, também propomos o Ensemble of Auto-Encoders. Essa abordagem é baseada na combinação de auto-encoders em um conjunto de modelos. Cada auto-encoder é especializado em reconhecer uma classe particular. O Conjunto de auto-encoders possui uma estrutura modular que tem a vantagem de tornar o modelo facilmente adaptado às mudanças dos dados. Além disso, permite modelos personalizados, pois o modelo pode se adaptar às classes mais frequentes. Esta contribuição se aplica ao problema do Reconhecimento da Atividade Humana. Os resultados experimentais mostram o potencial das abordagens mencionadas

    Online Clustering for Novelty Detection and Concept Drift in Data Streams

    No full text
    Data streams are related to large amounts of data that can continuously arrive with a probability distribution that may change over time. Depending on the changes in the data distribution, different phenomena can occur, like new classes can appear or concept drift can occur in existing classes. Machine Learning algorithms have been often used to model this data. New classes are patterns that were not seen during the training of the current classification model, but appear after some time. Concept drift occurs when the concepts associated with a dataset change as new data arrive. This paper proposes a new algorithm based on kNN that uses micro-clusters as prototypes and incrementally updates the micro-clusters or creates new micro-clusters when novelties are detected. In the online phase, each instance close to a micro-cluster is considered an extension of the micro-cluster, being used to adapt the model to concept drift. The proposed algorithm is experimentally compared with a state-of-the-art classifier from the data stream literature and one baseline. According to the experimental results, the proposed algorithm increases the predictive performance over time by incrementally learning changes in the data distribution

    An Ensemble of Autonomous Auto-Encoders for Human Activity Recognition

    Get PDF
    Human Activity Recognition is focused on the use of sensing technology to classify human activities and to infer human behavior. While traditional machine learning approaches use hand-crafted features to train their models, recent advancements in neural networks allow for automatic feature extraction. Auto-encoders are a type of neural network that can learn complex representations of the data and are commonly used for anomaly detection. In this work we propose a novel multi-class algorithm which consists of an ensemble of auto-encoders where each auto-encoder is associated with a unique class. We compared the proposed approach with other state-of-the-art approaches in the context of human activity recognition. Experimental results show that ensembles of auto-encoders can be efficient, robust and competitive. Moreover, this modular classifier structure allows for more flexible models. For example, the extension of the number of classes, by the inclusion of new auto-encoders, without the necessity to retrain the whole model

    An Efficient Scheme for Prototyping kNN in the Context of Real-Time Human Activity Recognition

    No full text
    The Classifier kNN is largely used in Human Activity Recognition systems. Research efforts have proposed methods to decrease the high computational costs of the original kNN by focusing, e.g., on approximate kNN solutions such as the ones relying on Locality-sensitive Hashing (LSH). However, embedded kNN implementations need to address the target device memory constraints and power/energy consumption savings. One of the important aspects is the constraint regarding the maximum number of instances stored in the kNN learning process (being it offline or online and incremental). This paper presents simple, energy/computationally efficient and real-time feasible schemes to maintain a maximum number of learning instances stored by kNN. Experiments in the context of HAR show the efficiency of our best approaches, and their capability to avoid the kNN storage runs out of training instances for a given activity, a situation not prevented by typical default schemes

    Ensemble Clustering for Novelty Detection in Data Streams

    No full text
    In data streams new classes can appear over time due to changes in the data statistical distribution. Consequently, models can become outdated, which requires the use of incremental learning algorithms capable of detecting and learning the changes over time. However, when a single classification model is used for novelty detection, there is a risk that its bias may not be suitable for new data distributions. A solution could be the combination of several models into an ensemble. Besides, because models can only be updated when labeled data arrives, we propose two unsupervised ensemble approaches: one combining clustering partitions using the same clustering technique; and other using different clustering techniques. We compare the performance of the proposed methods with well known novelty detection algorithms. The methods were tested on datasets commonly used in the novelty detection literature. The experimental results show that proposed ensembles have competitive performance for novelty detection in data streams
    corecore